On Checkpoint Latency
نویسنده
چکیده
Checkpointing and rollback is a technique for minimizing loss of computation in presence of failures. Two metrics can be used to characterize a checkpoint-ing scheme: (i) checkpoint overhead (increase in the execution time of the application because of a checkpoint), and (ii) checkpoint latency (duration of time required to save the checkpoint). For many checkpoint-ing methods, checkpoint latency is larger than checkpoint overhead. This paper evaluates the expression for \average overhead" of the checkpointing scheme as a function of checkpoint latency and overhead. It is shown that the \average overhead" is much more sensitive to the changes in checkpoint overhead, as compared to checkpoint latency. Also, for equi-distant checkpoints, the optimal checkpoint interval is shown to be independent of the checkpoint latency.
منابع مشابه
Impact of Checkpoint Latency on the Optimal Checkpoint Interval and Execution Time
The massive scale of current and next-generation massively parallel processing (MPP) systems presents significant challenges related to fault tolerance. In particular, the standard approach to fault tolerance, application-directed checkpointing, puts an incredible strain on the storage system and the interconnection network. This results in overheads on the application that severely impact perf...
متن کاملDesign and Evaluation of a Low-Latency Checkpointing Scheme for Mobile Computing Systems
Fault-tolerant mobile computing systems have different requirements and restrictions, not taken into account by conventional distributed systems. This paper presents a coordinated checkpointing scheme which reduces the delay involved in a global checkpointing process for mobile systems. A piggyback technique is used to track and record the checkpoint dependency information among processes durin...
متن کاملOn the Viability of Checkpoint Compression for Extreme Scale Fault Tolerance
The increasing size and complexity of high performance computing (HPC) systems have lead to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. In this work, we explore the feasibility of checkpoint data compression...
متن کاملAnother Two - Level Failure Recovery Scheme : Performance
This report deals with the design and evaluation of a \two-level" failure recovery scheme for distributed systems. In our previous work 30, 32], we motivated a \two-level" recovery approach that tolerates the more probable failures with a low overhead, and less probable failures with possibly higher overhead. The two-level approach can achieve a smaller overhead as compared to traditional recov...
متن کاملA Low-Latency DMR Architecture with Efficient Recovering Scheme Exploiting Simultaneously Copiable SRAM
This paper presents a novel architecture for a fault-tolerant high-performance system using a checkpoint/restart approach with dual modular redundancy (DMR). The proposed architecture can perform low-latency copy with instantaneously copiable SRAM. Furthermore, we can use an instantaneous comparison scheme that has more fault coverage than comparison with a cyclic redundancy check (CRC). Evalua...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995